Proactive Fault Monitoring in Enterprise Servers
نویسندگان
چکیده
New proactive fault monitoring innovations are being developed, demonstrated on executing servers, and productized for enhancing the reliability, availability, and serviceability of enterprise-class servers. A continuous system telemetry harness (CSTH) has been developed that collects time series signals relating to the health of dynamically executing servers. These time series provide quantitative metrics associated with physical variables (distributed temperatures, voltages, and currents throughout the system), "soft" performance variables (loads, throughputs, queue lengths, bit error rates, etc.), and various quality-of-service (QoS) metrics. The CSTH signals are continuously archived to an offline circular file (i.e. the "Black Box Flight Recorder") that is helping to identify and eliminate costly sources of No-Trouble-Founds (NTFs) in Sun systems; and the signals are concurrently processed in real time using advanced pattern recognition for proactive anomaly detection. Examples are presented of the uses of the CSTH coupled with pattern recognition for high-sensitivity predictive failure analysis that is helping to increase component and system availability goals while decreasing the incidence of "No Trouble Found" (NTF) events that have become a costly serviceability/warranty issue in the enterprise computing industry.
منابع مشابه
An Architecture for an Adaptive Intrusion-Tolerant Server
We describe a general architecture for intrusion-tolerant enterprise systems and the implementation of an intrusion-tolerant Web server as a specific instance. The architecture comprises functionally redundant COTS servers running on diverse operating systems and platforms, hardened intrusion-tolerance proxies that mediate client requests and verify the behavior of servers and other proxies, an...
متن کاملImproved Methods for Early Fault Detection in Enterprise Computing Servers Using SAS Tools
Advanced telemetry systems are being developed to collect and archive hundreds of system performance, throughput, quality-of-service (QoS), and physical variables for the purpose of enhancing the reliability, availability, serviceability, scalability, and security of business-critical enterprise computing servers. SAS software was chosen for this project because of the language's powerful codin...
متن کاملA High-Performance and Fault-Tolerant Flow Control Method for Enterprise Servers
Network routers for parallel enterprise servers need faulttolerance as well as high performance to support a seamless value chain of e-business. This paper introduces a new cut-through flow control method, called the pathfinder, which provides an efficient restarting capability without the extra header delivery overhead for normal non-faulty routes. We present the router architecture and the de...
متن کاملA proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures ...
متن کاملA request-routing framework for SOA-based enterprise computing
Enterprises may use a service-oriented architecture (SOA) to provide a streamlined interface to their business processes. To scale up the system, each tier in a composite service usually deploys multiple servers for load distribution and fault tolerance. Such load distribution across multiple servers within the same tier can be viewed as horizontal load distribution. One limitation of this appr...
متن کامل